Introducion

The motivation behind many RNA-Seq or transcriptomic studies is to detect the level of activity at a genomic locus, and determine if any changes are evident due to the specific biological question. Whilst we use RNA-Seq as one of the most common data types, in this context the sequence content is secondary to the abundances of genes/transcripts.

However, sequence information is still highly important as we can use this information to:

In the interests of time, we will focus primarily on quantifying gene abundances today.

Workflow summary

The basic outline of a common RNA-Seq analysis workflow is given in the following diagram. Today we’ll focus primarily on the gene-level analysis.

Today’s Data

Today we will perform the initial steps on your VM, with the later R sections performed on your local machine.

Today’s dataset can be obtained from https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-4119/, and this represents a set of reads from the human Chromosome 1, with known gene expression patterns. This came from an analysis in which multiple analytic methods were investigated and the full paper is available at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4712774/.

Log onto your VM

  • As you will know, the username is hub, with the same password unless you have changed it
  • Enter the following commands carefully. Cut & Paste is fine, but remember you’ll need to use Ctrl+Shift+V to paste into the terminal on your VM:
bash
cd ~
mkdir -p RNA_Practical/rawData/fastq/
cd RNA_Practical/rawData/fastq/
touch getData.sh
nano getData.sh

This has created a blank script we can write to download the data, in the folder we’ll use to store the data. We have then opened the script using nano ready for the next step. Copy & paste the following code into the script once it has opened in nano.

# /bin/bash

for i in 1 3 5 7 9 11
  do
    wget https://www.ebi.ac.uk/arrayexpress/files/E-MTAB-4119/E-MTAB-4119.raw.${i}.zip
  done

Exit using Ctrl+x then select y to save the script. Now we’ll make the script executable and run it.

chmod +x getData.sh
./getData.sh

This will being the download for the 6 files we’ll use today. The download should take approximately 30 mins, so while this is happening let’s set-up the rest of the work for the session

VM Setup

HISAT2

The aligner we’ll use today is called hisat2 and it’s an aligner which is splice aware, meaning it’s able to handle alignments which contain sequences from two separate exons. On your VM open another terminal while the downloads are completing

cd ~/Downloads
wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/downloads/hisat2-2.1.0-Linux_x86_64.zip
unzip hisat2-2.1.0-Linux_x86_64.zip
rm hisat2-2.1.0-Linux_x86_64.zip
cd
nano .bashrc

We have just installed hisat2 into your Downloads folder and we need to tell your session where to find it. We’ll now add it to your PATH.

Inside the nano editor` go to the final line, where you’ll see

export PATH=$PATH:/opt/bwa-0.7.16a:/opt/sabre

At the end of this line add :/home/hub/Downloads/hisat2-2.1.0 so that this line will look like the following

export PATH=$PATH:/opt/bwa-0.7.16a:/opt/sabre:/home/hub/Downloads/hisat2-2.1.0

Exit using Ctrl+x answering y to save the file. Now enter the following

source .bashrc
whereis hisat2

This should give you the output /home/hub/Downloads/hisat2-2.1.0/hisat2

Feature Counts

We’ll also need the software featureCounts from the Subread set of tools, and this can be installed as follows:

cd ~/Downloads
wget https://downloads.sourceforge.net/project/subread/subread-1.5.3/subread-1.5.3-Linux-x86_64.tar.gz
tar zxvf subread-1.5.3-Linux-x86_64.tar.gz
rm  subread-1.5.3-Linux-x86_64.tar.gz
cd
nano .bashrc

Once again, we’ll need to add this location to our path, so change the final line to be

export PATH=$PATH:/opt/bwa-0.7.16a:/opt/sabre:/home/hub/Downloads/hisat2-2.1.0:/home/hub/Downloads/subread-1.5.3-Linux-x86_64/bin

Exit using Ctrl+x answering y to save the file. Now enter the following

source .bashrc
whereis featureCounts

Genome Information

For this section, we need Sequence (fasta) and Gene Descriptions(gtf) files

For today, we’ll need the sequence from Chromosome 1 and a gtf which corresponds. We’ll get these from Ensembl, so cut and paste the following code directly into your VM.

cd
mkdir -p genomes/Hsapiens
cd genomes/Hsapiens
wget ftp://ftp.ensembl.org/pub/release-90/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.1.fa.gz
gunzip Homo_sapiens.GRCh38.dna.chromosome.1.fa.gz
wget ftp://ftp.ensembl.org/pub/release-90/gtf/homo_sapiens/Homo_sapiens.GRCh38.90.chr.gtf.gz
gunzip Homo_sapiens.GRCh38.90.chr.gtf.gz

The more alert will realise that this gtf contains all the information for the entire genome, so we’ll need to create one containing just the gene information for chr.

First we’ll copy the file header.

egrep "^#!" Homo_sapiens.GRCh38.90.chr.gtf > Homo_sapiens.GRCh38.90.chr1.gtf

Then we’ll just get the gene information for chr1. he first field in a gtf file is the chromosome, we this will be easy to extract using egrep

egrep "^1\s" Homo_sapiens.GRCh38.90.chr.gtf >> Homo_sapiens.GRCh38.90.chr1.gtf 

To avoid confusing ourselves, let’s delete the full gtf so we only have the one.

rm Homo_sapiens.GRCh38.90.chr.gtf

R

We already have R installed so all we need to do is install a few packages that we’ll need for today.

Open Rstudio, then Copy & Paste the following directly in the Console

source("https://bioconductor.org/biocLite.R")
biocLite(c("limma", "edgeR", "biomaRt"))

That’s everything done for the setup!!!

RNA Seq Analysis

Raw Data

By now, your data will have completed downloading so we’ll need to unzip everything. We can do this in one line so copy & paste the following

cd ~/RNA_Practical/rawData/fastq/
for f in $(ls *zip); do unzip ${f}; done
rm *zip

Here we have our 6 samples, with 3 from sampleA and another 3 from sampleB. Let’s run FastQC to make sure they’re acceptable.

FastQC

mkdir ~/RNA_Practical/rawData/FastQC
fastqc -o ~/RNA_Practical/rawData/FastQC *gz

These should all look pretty good, but we can get a few pieces of key information about these files.

How many reads are in each file?
How long are the reads?

Aligning our data

Before we align our data, we need to create an index of the genome. This basically allows the algorithm to perform the alignments quickly. This always needs to be performed by the aligner that we’ll be using, and today we’ll use hisat2.

In order to create a list of splice sites, we’ll first create a list of these using a script supplied with hisat2 based on those in the gtf file

cd ~/genomes/Hsapiens
hisat2-build Homo_sapiens.GRCh38.dna.chromosome.1.fa Homo_sapiens.GRCh38.dna.chromosome.1

This will take about 5 minutes to complete for chromosome 1, and considerably longer if we chose the entire human genome.

Once this process is complete, we’ll need to align each of our files to the indexed genome. To do this we’ll have to write a script.

First we’ll create all the directories we need:

cd ~/RNA_Practical
mkdir -p aligned/sam
mkdir -p aligned/bam
mkdir -p aligned/logs
touch runAlignments.sh
nano runAlignments.sh

This list line will open the empty script in nano for you, so cut & paste the following code into the open script.

#!/bin/bash
FQDIR=~/RNA_Practical/rawData/fastq
SAMDIR=~/RNA_Practical/aligned/sam
LOGDIR=~/RNA_Practical/aligned/logs
BAMDIR=~/RNA_Practical/aligned/bam
IDX=~/genomes/Hsapiens/Homo_sapiens.GRCh38.dna.chromosome.1

for f in ${FQDIR}/*gz
  do
    echo -e "Running alignment on ${f}"
    SAM=${SAMDIR}/$(basename ${f} _1.fq.gz).sam
    BAM=${BAMDIR}/$(basename ${f} _1.fq.gz).sorted.bam
    hisat2 --known-splicesite-infile ${z10_splicesites} \
    -x ${IDX} \
    -p 3 \
    -U ${f} \
    -S ${SAM} &> ${LOGDIR}/$(basename ${f} _1.fq.gz).log
    echo -e "done"
    echo "Converting ${SAM} to ${BAM}"
    samtools view -bh ${SAM} | samtools sort -o ${BAM}
    rm ${SAM}
  done

To exit & save the script hit Ctrl+x followed by y.

Now we need to make this script executable so enter.

chmod +x runAlignments.sh
./runAlignments.sh

These alignments will take quite a while, so while we’re waiting, go to the hisat2 page https://ccb.jhu.edu/software/hisat2/manual.shtml

Questions

  1. Explain what the following parameters meant during the call to hisat2
    1. -x ${IDX}
    2. -p 3
    3. -U ${f}
  2. How could we have specified splice sites?
  3. What does the \ symbol mean in the above script
  4. hisat2 only outputs data as a .samformat so we’ve converted this to a .bam format as part of the script. As part of this, we’ve used the pipe symbol (|).
    1. What does this second step do to our data?
    2. Why might this be an advantage
  5. Once at least one sample has finished being aligned, have a look at the log files. Can you interpret the output?
  6. Do you have any suggestions for improving the above script?

Counting Alignments

Once we’ve aligned our data, we need to count how many reads have aligned to each gene, and we’ll do this using the software featureCounts. This is the fastest part of today’s practical, especially as we sorted our bam files. Again, we’ll write a script to perform this step

First we’ll create all the directories we need:

cd ~/RNA_Practical
mkdir -p aligned/counts
touch countAlignments.sh
nano countAlignments.sh
#!/bin/bash
BAMDIR=~/RNA_Practical/aligned/bam
COUNTDIR=~/RNA_Practical/aligned/counts
GTF=~/genomes/Hsapiens/Homo_sapiens.GRCh38.90.chr1.gtf

# Find the entire list of files
BAMS=`find ${BAMDIR} -name "*sorted.bam" | tr '\n' ' '`

# Count all files in a single process
featureCounts \
  -Q 10 \
  -T 3 \
  -a ${GTF} \
  -o ${COUNTDIR}/counts.out ${BAMS}

RNA Seq Analysis using R

Loading the data into R

The above script will create a file with a single header line, followed by column names and the data. Some of the columns are not relevant for us today, so we’ll first load the file, then remove all the columns which are not required.

(If your alignments are not ready, the file can be found here)

However, first we need to load the R packages for differential gene analysis

library(readr)
library(dplyr)
library(tibble)
library(magrittr)
library(edgeR)
library(limma)
file <- "counts.out"
counts <- read_delim(file, comment = "#", delim = "\t")

Columns 2 to 6 are not particularly relevant for us today. The counts for each gene are in columns 7 & beyond, whilst the first column contains the gene IDs.

The type of object we like to use in R is known as a DGEList and we’ll need to set our data up as this object type before being able to do any meaningful analysis.

First we’ll select the columns we need, then we’ll change this into a matrix, which is what the DGEList objects need

dgeList <- counts %>%
  dplyr::select(Geneid, ends_with("bam")) %>%
  as.data.frame() %>%
  column_to_rownames("Geneid") %>%
  as.matrix() %>%
  DGEList()

Obviously these sample names are not convenient, so let’s edit these a little. In DGEList objects, the sample names are also the column names.

colnames(dgeList) <- basename(colnames(dgeList))
colnames(dgeList) <- gsub(".sorted.bam", "", colnames(dgeList))

Let’s have a look at the object

dgeList

You can see that we have the counts for every gene in a component called dgeList$counts, and information about the samples in a component called dgeList$samples.

Let’s add a column to the $samples component which defines which sample group we are in.

dgeList$samples$type <- gsub("sample([AB]).*", "\\1", rownames(dgeList$samples))
dgeList$samples$type <- as.factor(dgeList$samples$type)
dgeList$samples$group <- as.integer(dgeList$samples$type)

In the second line above, we converted the text to a categorical variable (i.e. a factor), and then we used the category number to assign the values in the group column.

Notice in the $samples component, there is also a column called lib.size. This is the total number of reads aligned to genes within each sample.

Let’s see if we have much difference between samples & groups?

library(ggplot2)
theme_set(theme_bw())
dgeList$samples %>%
  rownames_to_column("sample") %>%
  ggplot(aes(x = sample, y = lib.size / 1e6, fill = type)) +
  geom_bar(stat = "identity") +
  ylab("Library size (millions)")

In today’s data, we don’t have a huge difference, but this can vary greatly in many experiments. When analysing counts, the number of counts will clearly be affected by the total library size (i.e. the total number of reads counted as aligning to genes). We before passing this data to any statistical models, we need to calculate a scaling factor that compensates for this.

dgeList <- calcNormFactors(dgeList)

Notice that now the column norm.factorsis no longer all 1. This is used by all downstream functions in edgeR.

Data Exploration

One of the things we look for in a dataset, is that the samples from each treatment,group together when we apply dimensional reduction techniques such as Principal Component Analysis (PCA) or Multi Dimensional Scaling (MDS). The package edgeR comes with a convenient function to allow this.

plotMDS(dgeList, labels = dgeList$samples$type, col = dgeList$samples$group)

Here we can see a clear pattern of separation between the groups.

(If we don’t see this, it may mean we have some difficult decisions to make. Maybe we need to check our samples for mislabelling? Maybe there is some other experimental factor which we’re unaware of.)

An interesting way to view our data is to use the value known as Counts per million. This accounts for difference in library sizes and gives an estimate of how abundant each gene is.

head(cpm(dgeList))

You can see from this first 6 genes, that we have one highly abundant gene, and a couple with far lower expression levels.

Filtering out genes

Genes with low count numbers give us minimal statistical power to detect an changes in expression level, so a common approach is to filter out these genes. This reduces the issues which we will face due to multiple hypothesis testing, effectively increasing the statistical power of a study.

A common method would be to remove any genes which are below a specific CPM threshold in the majority of samples. In this dataset, we might like to remove genes which have \(<1\) CPM in 3 or more samples. As we have 3 samples in each group, this means we need genes to have greater than about 10 reads aligning to them, in each sample from at least one group. Any number of strategies can be applied for this stage.

We can plot the densities of each of our samples using log-transformed CPM values, and the clear peak in the range of very low expression is clearly visible.

plotDensities(cpm(dgeList, log = TRUE))

Let’s filter our dataset, and remove genes with low abundances.

genes2keep <- rowSums(cpm(dgeList) > 1) > 3

Here we’ve created a logical vector defining which genes to keep, and we have marked 3244 genes for removal, whilst marking 2019 to be retained.

summary(genes2keep)

Now let’s look at the densities after filtering.

plotDensities(cpm(dgeList, log = TRUE)[genes2keep, ])

Now we’re happy that we have the genes we can extract meaningful results from, let’s remove them from our dataset.

dgeList <- dgeList[genes2keep, ]

Calculating moderated dispersions

Most RNA-Seq analysis is performed using the Negative Binomial Distribution. This is similar to a Poisson distribution, where we count the number of times something appears in a given interval. The difference with a Negative Binomial approach is that we can model data which is more variable. Under the Poisson assumptions, the variance equals the mean, whilst under the NB this is no longer required.

This extra variability is known as the dispersion, and our estimates of dispersion will be too high for some genes, and too low for others. Taking advantage of the large numbers of genes in an experiment, we can shrink the ones that are too high, and increase the ones that are too small. This is an important step in RNA Seq analysis, as it reduces the number of false positives, and increase the numbers of true positives.

Before we do this, we need to define our statistical model. Here, our term of interest is in the column type.

design <- model.matrix(~0 + type, data = dgeList$samples)
design
dgeList <- estimateDisp(dgeList, design = design)

Performing differential expression analysis

The most common analytic approach we use is an Exact Test, proposed by Robinson and Smyth. This is analogous to Fisher’s Exact Test, as once again we are dealing with count data, not Normally distributed data.

in the following line of code, we are comparing the first two groups in our data. Clearly we only have two groups in this dataset,but it is quite common to have multiple groups in other analyses.

etResults <- exactTest(dgeList, pair = 1:2)$table 

As we’ve discovered, we need to account for our multiple testing considerations. Under the null hypothesis, we would expect the distribution of \(p\)-values to be approximately uniform on the interval [0, 1] . However, under the the alternative hypothesis, we should see a spike of \(p\)-values near zero. A “healthy” dataset will likely have a mixture of the two distributions, so we should see a flat distribution, with a cluster of values near zero.

hist(etResults[,"PValue"], breaks= 50)

This gives us confidence that we will be able to detect DE genes in this dataset. Now, we can proceed to estimate the False Discovery Rate in this data.

etResults <- etResults %>%
  rownames_to_column("Geneid") %>%
  mutate(FDR = p.adjust(PValue)) %>%
  arrange(PValue) %>%
  as_data_frame()

The highest ranked gene in this dataset is ENSG00000117472. It’s been given a negative value, which means it’s down-regulated in the second of the two conditions. Let’s check the raw counts.

cpm(dgeList, log= TRUE)["ENSG00000117472",]

Inspect a few more of the highly ranked genes, so make sure you can understand these results.

Let’s get a list of significant genes, by using an FDR of 0.01, and logFC \(>1\). (logFC is fold-change on the \(\log_2\) scale, so that logFC=1, is a two-fold increase, logFC=2 is a 4-fold increase,and logFC=-1 is a halving of expression)

sigGenes <- filter(etResults, FDR< 0.01, abs(logFC) > 1)$Geneid

Now we can visualise the pattern of logFC to expression level.

plotMD(dgeList, status = rownames(dgeList)%in%sigGenes)

An alternative is to plot the logFC in relation to the \(p\)-value, to make what is known as a volcano plot.

etResults %>%
  mutate(DE = Geneid %in% sigGenes) %>%
  ggplot(aes(x = logFC, y = -log10(PValue),colour = DE)) +
  geom_point() +
  scale_colour_manual(values = c("grey50", "red"))

Downstream analysis

From here, the next steps would be to explore any enrichment of GO terms, KEGG terms, functional pathways, or any other pre-defined gene groupings. An easy way to do this would be to convert our gene identifiers to EntrezGene identifiers, then use the functions goana() or kegga() in the limma package.

To do this we’d need to first convert our IDs using the package biomaRt

library(biomaRt)
mart <- useMart(biomart = "ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl")
attr2get <- c("ensembl_gene_id", "entrezgene")
ens2entrez <- getBM(attributes = attr2get, filters = "ensembl_gene_id", values = rownames(dgeList), mart = mart)
head(ens2entrez)
goRes <- goana(de = filter(ens2entrez, ensembl_gene_id %in% sigGenes)$entrezgene, 
      universe = ens2entrez$entrezgene )

To obtain a quick list of the results:

arrange(goRes, P.DE) %>% head